Full Name:

Online Assignments: There is a new online-assignments at the DataCamp. This online exercise, related to data visualization in R, e.g., by using the ggplot2 package. These exercises are useful to prepare yourself for the computer-lab. The online-assignments at the DataCamp are not mandatory.

Your task is to answer the questions in this R-markdown file. Submit both your R-markdown (.Rmd) file and the HTML file on Canvas.

Note: The exercise of this week has 100 points. Besides, the Bonus part has 30 extra points.

1 Exploratory Data Analysis for customer churn prediction (30 points)

Customer Churn is a topic that matters to organizations of all sizes. Customer churn occurs when customers stop doing business with a company, also known as customer attrition. Churn (loss of customers to competition) is a major problem for telecom companies because it is well known that it is more expensive to acquire a new customer than to keep an existing customer. Here, we use Exploratory Data Analysis to explore the churn dataset. Basically, we want to visualize and identify which factors contribute to customer churn.

Dataset: The churn data set is available in the R package liver. The data set contains ‘5000’ rows (customers) and 20 columns (features). The last column called churn is the target variable which indicates whether customers churned (left the company) or not. If you want to know more about the dataset just type ?churn in your R console. You also can find more information about this dataset here.

Here we need to load the following R packages:

NOTE: If you have not installed those two packages, you should first install them.

To load the packages:

library( ggplot2 )  
library( liver   )  
library( GGally  )  
library( psych   )  
library( skimr   )  
library( Hmisc   )
library( plyr    )
library( ggpubr  )

1.1 Business Understanding

Companies are interested to know who is gonna get churned so they can proactively go to the customer to provide them better services and turn customers’ decisions in the opposite direction. Companies are interested to know:

  • What customers are we losing?
  • Why are we losing them?
  • How do we stop them from leaving the company?

To answer these questions here, as a practical example, we use the churn data set is available in the R package liver.

1.2 Data Understanding

This dataset comes from IBM Sample Data Sets. The data set contains 5000 rows (customers) and 20 columns (features). The “churn” column is our target which indicates whether the customer churned (left the company) or not. The 20 variables are:

  • state: Categorical, for the 51 states and the District of Columbia.
  • area.code: Categorical.
  • account.length: count, how long account has been active.
  • voice.plan: Categorical, yes or no, voice mail plan.
  • voice.messages: Count, number of voice mail messages.
  • intl.plan: Categorical, yes or no, international plan.
  • intl.mins: Continuous, minutes customer used service to make international calls.
  • intl.calls: Count, total number of international calls.
  • intl.charge: Continuous, total international charge.
  • day.mins: Continuous, minutes customer used service during the day.
  • day.calls: Count, total number of calls during the day.
  • day.charge: Continuous, total charge during the day.
  • eve.mins: Continuous, minutes customer used service during the evening.
  • eve.calls: Count, total number of calls during the evening.
  • eve.charge: Continuous, total charge during the evening.
  • night.mins: Continuous, minutes customer used service during the night.
  • night.calls: Count, total number of calls during the night.
  • night.charge: Continuous, total charge during the night.
  • customer.calls: Count, number of calls to customer service.
  • churn: Categorical, yes or no. Indicator of whether the customer has left the company (yes or no).

We import the dataset in R as follows:

data( churn ) # load the "churn" dataset

To see the overview of the dataset in R we could use the following functions:

  • str to see a compact display of the structure of the data.
  • View to see spreadsheet-style data.
  • head to see the first part of the data (first 6 rows of the data).
  • summary to see the summary of each variable.

To see the overview of the dataset in R we are using function str as follows:

str( churn )   # Compactly display the structure of the data
  'data.frame': 5000 obs. of  20 variables:
   $ state         : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
   $ area.code     : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
   $ account.length: int  128 107 137 84 75 118 121 147 117 141 ...
   $ voice.plan    : Factor w/ 2 levels "yes","no": 1 1 2 2 2 2 1 2 2 1 ...
   $ voice.messages: int  25 26 0 0 0 0 24 0 0 37 ...
   $ intl.plan     : Factor w/ 2 levels "yes","no": 2 2 2 1 1 1 2 1 2 1 ...
   $ intl.mins     : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
   $ intl.calls    : int  3 3 5 7 3 6 7 6 4 5 ...
   $ intl.charge   : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
   $ day.mins      : num  265 162 243 299 167 ...
   $ day.calls     : int  110 123 114 71 113 98 88 79 97 84 ...
   $ day.charge    : num  45.1 27.5 41.4 50.9 28.3 ...
   $ eve.mins      : num  197.4 195.5 121.2 61.9 148.3 ...
   $ eve.calls     : int  99 103 110 88 122 101 108 94 80 111 ...
   $ eve.charge    : num  16.78 16.62 10.3 5.26 12.61 ...
   $ night.mins    : num  245 254 163 197 187 ...
   $ night.calls   : int  91 103 104 89 121 118 118 96 90 97 ...
   $ night.charge  : num  11.01 11.45 7.32 8.86 8.41 ...
   $ customer.calls: int  1 1 0 2 3 0 3 0 1 0 ...
   $ churn         : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

It shows that data are as a data.frame object in R with 5000 observations and 20 variables. The last column (with name churn) is the target variable that indicates whether customers churned (left the company) or not.

By using the function summary in R, we can see the summary of the dataset as follows

summary( churn )
       state              area.code    account.length  voice.plan
   WV     : 158   area_code_408:1259   Min.   :  1.0   yes:1323  
   MN     : 125   area_code_415:2495   1st Qu.: 73.0   no :3677  
   AL     : 124   area_code_510:1246   Median :100.0             
   ID     : 119                        Mean   :100.3             
   VA     : 118                        3rd Qu.:127.0             
   OH     : 116                        Max.   :243.0             
   (Other):4240                                                  
   voice.messages   intl.plan    intl.mins       intl.calls      intl.charge   
   Min.   : 0.000   yes: 473   Min.   : 0.00   Min.   : 0.000   Min.   :0.000  
   1st Qu.: 0.000   no :4527   1st Qu.: 8.50   1st Qu.: 3.000   1st Qu.:2.300  
   Median : 0.000              Median :10.30   Median : 4.000   Median :2.780  
   Mean   : 7.755              Mean   :10.26   Mean   : 4.435   Mean   :2.771  
   3rd Qu.:17.000              3rd Qu.:12.00   3rd Qu.: 6.000   3rd Qu.:3.240  
   Max.   :52.000              Max.   :20.00   Max.   :20.000   Max.   :5.400  
                                                                               
      day.mins       day.calls     day.charge       eve.mins       eve.calls    
   Min.   :  0.0   Min.   :  0   Min.   : 0.00   Min.   :  0.0   Min.   :  0.0  
   1st Qu.:143.7   1st Qu.: 87   1st Qu.:24.43   1st Qu.:166.4   1st Qu.: 87.0  
   Median :180.1   Median :100   Median :30.62   Median :201.0   Median :100.0  
   Mean   :180.3   Mean   :100   Mean   :30.65   Mean   :200.6   Mean   :100.2  
   3rd Qu.:216.2   3rd Qu.:113   3rd Qu.:36.75   3rd Qu.:234.1   3rd Qu.:114.0  
   Max.   :351.5   Max.   :165   Max.   :59.76   Max.   :363.7   Max.   :170.0  
                                                                                
     eve.charge      night.mins     night.calls      night.charge   
   Min.   : 0.00   Min.   :  0.0   Min.   :  0.00   Min.   : 0.000  
   1st Qu.:14.14   1st Qu.:166.9   1st Qu.: 87.00   1st Qu.: 7.510  
   Median :17.09   Median :200.4   Median :100.00   Median : 9.020  
   Mean   :17.05   Mean   :200.4   Mean   : 99.92   Mean   : 9.018  
   3rd Qu.:19.90   3rd Qu.:234.7   3rd Qu.:113.00   3rd Qu.:10.560  
   Max.   :30.91   Max.   :395.0   Max.   :175.00   Max.   :17.770  
                                                                    
   customer.calls churn     
   Min.   :0.00   yes: 707  
   1st Qu.:1.00   no :4293  
   Median :1.00             
   Mean   :1.57             
   3rd Qu.:2.00             
   Max.   :9.00             
  

It shows the summary of all the 20 variables.

a. For each variable in the churn dataset, specify its type.

Answer a
Variable R Datatype Statistical Datatype
state Factor categorical - nominal
area.code Factor categorical - nominal
account.length int numerical - discrete
voice.plan Factor categorical - binary
voice.messages int numerical - discrete
intl.plan Factor categorical - binary
intl.mins num numerical - continuous
intl.calls int numerical - discrete
intl.charge num numerical - continuous
day.mins num numerical - continuous
day.calls int numerical - discrete
day.charge num numerical - continuous
eve.mins num numerical - continuous
eve.calls int numerical - discrete
eve.charge num numerical - continuous
night.mins num numerical - continuous
night.calls int numerical - discrete
night.charge num numerical - continuous
customer.calls int numerical - discrete
churn Factor categorical - binary

b. Based on the output of the summary function for the churn dataset, what is the number of customers who have an international plan (intl.plan = "yes")?

Answer b
intl_plan_holder = length(which(churn $ intl.plan == "yes"))
intl_plan_holder
  [1] 473

The customer churn rate is 473.

1.3 Investigate the target variable churn

Here we report a bar plot for the target variable churn by using function ggplot() from the R package ggplot2 as follows:

ggplot( data = churn ) + 
    geom_bar( aes( x = churn ), fill = c( "red", "blue" ) ) +
    labs( title = "Bar plot for the target variable 'churn'" )  

Summary for the target variable churn

summary( churn $ churn )
   yes   no 
   707 4293

Based on the above output, what is the proportion of the churner (customer churn rate)?

Answer

\(P(churn = no) = 4293\)

\(P(churn = yes) = 707\)

\(n = P(churn = no) + P(churn = yes) = 4293 + 707 = 5000\)

\(P(churn = yes) = \frac{P(churn = yes)}{n} = \frac{707}{5000} = 0.1414 = 14.14\%\)

The proportion of churners is \(14.14\%\).

1.4 Investigate variable International Plan

Here we first report a contingency table of International Plan (intl.plan) with churn

table( churn $ churn, churn $ intl.plan, dnn = c( "Churn", "International Plan" ) )
       International Plan
  Churn  yes   no
    yes  199  508
    no   274 4019

Here is the above contingency table with margins

addmargins( table( churn $ churn, churn $ intl.plan, dnn = c( "Churn", "International Plan" ) ) )
       International Plan
  Churn  yes   no  Sum
    yes  199  508  707
    no   274 4019 4293
    Sum  473 4527 5000

Bar chart for International Plan

ggplot( data = churn ) + 
  geom_bar( aes( x = intl.plan, fill = churn ) ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

ggplot( data = churn ) + 
  geom_bar( aes( x = intl.plan, fill = churn ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

What would be your interpretation of the above plots?

Answer

The first plot depicts a bar chart of intl.plan, with churn overlay. The chart shows that approximately 500 customers have an international plan, while the rest do not. The proportions of churners is not clearly readable, but approximately half of the customers who have an international plan tend to churn. The churn rate is much smaller for customers who do not have international plan.

The right chart is the standardised version of the bar chart on the left. The standardised chart’s y-axis ranges between 0 and 1, and it shows the proportions of churners in each intl.plan category. This chart allows for more precision; we read that approximately 42% of international plan holders change to a different company, while this number is around 11% for those not having an international plan.

Conclusively, international plan holders churn rate is about four times higher than for non subscribers. The variable in question may bear with explanatory power, hence it would not come as a surprise if the data mining algorithm would use intl.plan in the prediction model.

1.5 Investigate variable “voice mail plan

Make a table for counts of Churn and Voice Mail Plan

addmargins(table( churn $ churn, churn $ voice.plan, dnn = c( "Churn", "Voice Mail Plan" ) ))
       Voice Mail Plan
  Churn  yes   no  Sum
    yes  102  605  707
    no  1221 3072 4293
    Sum 1323 3677 5000

Bar chart for Voice Mail Plan

ggplot( data = churn ) + 
  geom_bar( aes( x = voice.plan, fill = churn ) ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

ggplot( data = churn ) + 
  geom_bar( aes( x = voice.plan, fill = churn ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

What would be your interpretation of the above plots?

Answer

The left plot depicts a bar chart for voice.plan, with churn overlay, grouped by voice.plan categories. Those of voice plan holders represent themselves with 1300 customers, while those of non-holders about 3700 customers.

The right plot is a standardised bar chart, and shows that approximately 8% of voice mail plan holders churn. For those who did not opt for voice mail plan, the churn rate is approximately twice as much, 16%.

Conclusively, churn rate is higher among customers who do not have voice mail plan, than for those who do have. voice.plan may have some explanation for the 14.14% churn rate, and hence we expect that the data mining algorithm will include it in prediction model.

1.6 Investigate variable “customer service calls

Here, we are interested to investigate the relationship between variable “customer service calls” and the target variable “churn”. First, we report the histogram of the variable “customer service calls” by using function ggplot as follows

ggplot( data = churn ) +
  geom_bar( aes( x = factor( customer.calls ) ) ) 

To see the relationship between variable “customer service calls” and the target variable “churn”, we report the histogram of the variable “customer service calls” including “churn” overlay as follows

ggplot( data = churn ) +
  geom_bar( aes( x = factor( customer.calls ), fill = churn ), position = "stack" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

We also report the Normalized histogram of variable “customer service calls” including “churn” overlay as follow

ggplot( data = churn ) +
  geom_bar( aes( x = factor( customer.calls ), fill = churn ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

What would be your interpretation of the above plots?

Answer

The first plot is a regular bar chart of customer.calls, and it shows positively a skewed distribtuion, whit a mean of 1.5704 number of calls. The skewness of the data may be explained by that the majority of customers who contacts the customer service desk had their issue resolved within one call, and hence there was no need for further calls.

The second plot is similar to the first chart, however, it adds churn overlay. The overlay expresses the count of the specific customer.calls bar, how many have versus have not churned. Since it is challenging to read the proportions of the histogram we turn to the standardised chart, which is the last plot of this section.

The standardised chart depicts the proportions of churners vs non-churners for each customer call counts. We observe that that churn rate is nearly constant at around 11% for customers who had 3 or less customer service calls. The churn rate, however, at least quadruples for those contacting the customer service desk at least 4 times. It seems, that customers tolerance or satisfaction drops significantly as soon as they need to call at least 4 times the customer service.

It is worth to note, that the standardised and non-standardised plots must be used together, otherwise it may lead to misleading observations. Such as, had we not taken into account the non-stardardised bar chart, then we would have concluded that customers calling the service desk 9 times churn with certainty. However, for this observation, the data is not representative, because of the low sample size - i.e, there are only a few customers with 9 customer services calls.

Given the strong graphical evidence of predictive importance of customer.calls, it is expected that the data mining algorithm will include customer.calls in the model.

1.7 Investigate variable “Day Minutes

Here, we are interested to investigate the relationship between variable Day Minutes and the target variable Churn. First, we report the “Normalized” histogram of Day Minutes including Churn overlay:

ggplot( data = churn ) +
  geom_histogram( aes( x = day.mins, fill = churn ), position = "fill", binwidth = 25, color="white" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

Another way to see the relationship between variable Day Minutes and the target variable churn, would be by using the boxplot as follows

ggplot( data = churn ) +
  geom_boxplot( aes( x = churn, y = day.mins ), fill = c( "red", "blue" ) ) 

What would be your interpretation of the above boxplot?

Answer

The above boxplots depict the locality, spread and skewness of day.mins grouped by churn categories.

Comparison of location: For both churn categories, that is for churners and non-churners, the distribution is approximately symmetric. This finding bears with the advantage, that we can state that the distributions’ mean and median are approximately at the same location. Knowing this, allows us to draw that the daily length of calls for churners is approximately on average 215 minutes. The length of the calls for those of non-churners is on average around 190 minutes a day.

To allow for a more precise analysis, we need to plot the data as frequency density plot; the below code performs that.

ggplot( data = churn) + 
  geom_density( aes( x = day.mins, fill = churn ), alpha = 0.3)

We observe, that churners show bimodality; at 160 minutes, and 270 minutes. That is, customers with daily phone calls of 160, and 270 minutes churn with the highest rate. Furthermore, we can read that customer churn rate increases as day.mins exceeds 200 minutes.

Comparison of dispersion: From the boxplots we read that the interquartile range for churners is approximately twice as large than for non-churners. Additionally, we read from the width of the whiskers, that the range of day.mins for churners is wider than for non-churners.

Comparison of skewness: day.mins for churners is skewed to the left - churners spend more time on average on phone calls, than non-churners.

Comparison of potential outliers: day.mins’s whisker spreads from the minima of the data to the maxima, and hence it does not have any outliers, or unusual values. This cannot be stated for non-churners, since its boxplot reports potential outliers beyond the lower and upper whiskers.

day.minutes seem to hold relevant information, and therefore we should expect the data mining algorithm to select this variable into the model.

1.8 Investigate variable “International Calls

Here, we are interested to investigate the relationship between variable International Calls and the target variable churn. First, we report the histogram of the variable International Calls as follows:

ggplot( data = churn ) +
  geom_bar( aes( x = intl.calls ) ) 

To see the relationship between variable International Calls and the target variable churn, we report the histogram of variable International Calls including Churn overlay as follow

ggplot( data = churn ) +
  geom_bar( aes( x = intl.calls, fill = churn ), position = "stack" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

We also report the Normalized histogram of variable International Calls including Churn overlay as follow

ggplot( data = churn ) +
  geom_bar( aes( x = intl.calls, fill = churn ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

To see the relationship between variable International Calls and the target variable churn, we report the boxplot as follow

ggplot( data = churn ) +
  geom_boxplot( aes( x = churn, y = intl.calls ), fill = c( "red", "blue" ) ) 

What would be your interpretation of the above boxplot?

Answer

The above boxplots depict the locality, spread and skewness of intl.calls grouped by churn categories.

Comparison of location: Based on the boxplots, we see that the median count of internatinal calls made by churners and non-churners are approximately the same.

Comparison of dispersion: The interquartile range for churners versus non-churners are reasonably similar, as well as the overall range of the data, seen by measuring the whisker lengths.

Comparison of skewness: Both batch of data are positively skewed, however, for churners it is more than for non-churners. Meaning, that churners have a lower count of international calls than that of non-churners.

Comparison of potential outliers: Both data contains outliers beyond the upper whisker, however, nothing deterministic may be drawn from this observation.

As a general conclusion, the above plots do not indicate strong graphical evidence of the predictive importance of international calls. Therefore, it may be that the data mining algorithm would not include it in the model.

1.9 Detect Correlated Variables

To visualize the correlation matrix between the “day.mins”, “Day.Calls”, “Day.Charge”, “Eve.Mins”, “Eve.Calls”, “Eve.Charge”, “Night.Mins”, “Night.Calls”, and “Night.Charge”, we could use the ggcorr function as follows

variable_list = c( "intl.mins",  "intl.calls",  "intl.charge", 
                   "day.mins",   "day.calls",   "day.charge",
                   "eve.mins",   "eve.calls",   "eve.charge",
                   "night.mins", "night.calls", "night.charge" )

ggcorr( data = churn[ , variable_list ], label = TRUE ) 

pairs.panels( churn[ , c( "intl.mins", "intl.calls", "intl.charge" ) ] ) 

pairs.panels( churn[ , c( "day.mins", "day.calls", "day.charge" ) ] ) 

pairs.panels( churn[ , c( "eve.mins", "eve.calls", "eve.charge" ) ] ) 

pairs.panels( churn[ , c( "night.mins", "night.calls", "night.charge" ) ] ) 

What would be your interpretation of the above correlation matrix plots?

Answer

From the above correlation matrix we see that there are 4 problematic variables that shows perfect correlation (r=1.0) with another variable.

Namely, night.charge is perfectly linearly correlated with night.mins, which does not come as surprise, since the former is a function of night.mins. The same holds true for eve.mins and eve.charge, day.mins and day.charge, and lastly intl.mins and intl.charge.

Using correlated variables will overemphasize one data component, or at worst the data may become unstable and deliver unreliable results. Therefore we need to retain only one of the correlated variables. Furthermore, including highly correlated variables would result in multicollinearity which would result in unreliable regression model coefficients.

2 Exploratory Data Analysis for Bank direct marketing dataset (70 points)

In this part, we want to use Exploratory Data Analysis to explore the bank dataset that is available in the R package liver. You could find more information about the bank dataset at the following link on pages 4-5: manual of the liver package; Or here.

2.1 Business Understanding

Find the best strategies to improve for the next marketing campaign. How can the financial institution have greater effectiveness for future marketing campaigns? To make a data-driven decision, we need to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions to develop future strategies.

2.1.1 Bank direct marketing info

Two main approaches for enterprises to promote products/services are:

  • mass campaigns: targeting general indiscriminate public,
  • directed marketing, targeting a specific set of contacts.

In general, positive responses to mass campaigns are typically very low (less than 1%). On the other hand, direct marketing focuses on targets that are keener to that specific product/service, making this kind of campaign more effective. However, direct marketing has some drawbacks, for instance, it may trigger a negative attitude towards banks due to the intrusion of privacy.

Banks are interested to increase financial assets. One strategy is to offer attractive long-term deposit applications with good interest rates, in particular, by using directed marketing campaigns. Also, the same drivers are pressing for a reduction in costs and time. Thus, there is a need for an improvement in efficiency: lesser contacts should be done, but an approximate number of successes (clients subscribing to the deposit) should be kept.

2.1.2 What is a Term Deposit?

A Term Deposit is a deposit that a bank or a financial institution offers with a fixed rate (often better than just opening a deposit account), in which your money will be returned at a specific maturity time. For more information with regards to Term Deposits please check here.

2.2 Data Undestanding

The bank dataset is related to direct marketing campaigns of a Portuguese banking institution. You can find more information related to this dataset at: https://rdrr.io/cran/liver/man/bank.html

The marketing campaigns were based on phone calls. Often, more than one contact (to the same client) was required, to access if the product (bank term deposit) would be (or not) subscribed. The classification goal is to predict if the client will subscribe to a term deposit (variable deposit).

We import the bank dataset:

data( bank )      

We can see the structure of the dataset by using the str function:

str( bank )
  'data.frame': 4521 obs. of  17 variables:
   $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
   $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
   $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
   $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
   $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
   $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
   $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
   $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
   $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
   $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
   $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
   $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
   $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
   $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
   $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
   $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
   $ deposit  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

It shows that the bank dataset as a data.frame has 17 variables and 4521 observations. The dataset has 16 predictors along with the target variable deposit which is a binary variable with 2 levels “yes” and “no”. The variables in this dataset are:

  • age: numeric.
  • job: type of job; categorical: “admin.”, “unknown”, “unemployed”, “management”, “housemaid”, “entrepreneur”, “student”, “blue-collar,”self-employed”, “retired”, “technician”, “services”.
  • marital: marital status; categorical: “married”, “divorced”, “single”; note: “divorced” means divorced or widowed.
  • education: categorical: “secondary”, “primary”, “tertiary”, “unknown”.
  • default: has credit in default?; binary: “yes”,“no”.
  • balance: average yearly balance, in euros; numeric.
  • housing: has housing loan? binary: “yes”, “no”.
  • loan: has personal loan? binary: “yes”, “no”.

Related with the last contact of the current campaign:

  • contact: contact: contact communication type; categorical: “unknown”,“telephone”,“cellular”.
  • day: last contact day of the month; numeric.
  • month: last contact month of year; categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”.
  • duration: last contact duration, in seconds; numeric.

Other attributes:

  • campaign: number of contacts performed during this campaign and for this client; numeric, includes last contact.
  • pdays: number of days that passed by after the client was last contacted from a previous campaign; numeric, -1 means client was not previously contacted.
  • previous: number of contacts performed before this campaign and for this client; numeric.
  • poutcome: outcome of the previous marketing campaign; categorical: “success”, “failure”, “unknown”, “other”.

Target variable:

  • deposit: Indicator of whether the client subscribed a term deposit; binary: “yes” or “no”.

Following Part 1, first, report the summary of the dataset then apply the Exploratory Data Analysis.

Answer
Summary of the bank dataset

Summary of the bank dataset

The below code reports a descriptive statistics of the bank dataset.

summary(bank)
        age                 job          marital         education    default   
   Min.   :19.00   management :969   divorced: 528   primary  : 678   no :4445  
   1st Qu.:33.00   blue-collar:946   married :2797   secondary:2306   yes:  76  
   Median :39.00   technician :768   single  :1196   tertiary :1350             
   Mean   :41.17   admin.     :478                   unknown  : 187             
   3rd Qu.:49.00   services   :417                                              
   Max.   :87.00   retired    :230                                              
                   (Other)    :713                                              
      balance      housing     loan           contact          day       
   Min.   :-3313   no :1962   no :3830   cellular :2896   Min.   : 1.00  
   1st Qu.:   69   yes:2559   yes: 691   telephone: 301   1st Qu.: 9.00  
   Median :  444                         unknown  :1324   Median :16.00  
   Mean   : 1423                                          Mean   :15.92  
   3rd Qu.: 1480                                          3rd Qu.:21.00  
   Max.   :71188                                          Max.   :31.00  
                                                                         
       month         duration       campaign          pdays       
   may    :1398   Min.   :   4   Min.   : 1.000   Min.   : -1.00  
   jul    : 706   1st Qu.: 104   1st Qu.: 1.000   1st Qu.: -1.00  
   aug    : 633   Median : 185   Median : 2.000   Median : -1.00  
   jun    : 531   Mean   : 264   Mean   : 2.794   Mean   : 39.77  
   nov    : 389   3rd Qu.: 329   3rd Qu.: 3.000   3rd Qu.: -1.00  
   apr    : 293   Max.   :3025   Max.   :50.000   Max.   :871.00  
   (Other): 571                                                   
      previous          poutcome    deposit   
   Min.   : 0.0000   failure: 490   no :4000  
   1st Qu.: 0.0000   other  : 197   yes: 521  
   Median : 0.0000   success: 129             
   Mean   : 0.5426   unknown:3705             
   3rd Qu.: 0.0000                            
   Max.   :25.0000                            
  
EDA of the target variable, deposit

EDA of the target variable, deposit

The variable deposit is the target variable, and an indicator of whether the client subscribed to a term deposit or not.

First, deposit is concluded in a table, and then plotted on a bar chart.

addmargins( table( bank $ deposit, dnn = c( "Deposit" ) ) )
  Deposit
    no  yes  Sum 
  4000  521 4521
ggplot( data = bank ) + 
    geom_bar( aes( x = deposit ), fill = c( "red", "blue" ) ) +
    labs( title = "Bar plot for the target variable 'deposit'" )  

The table and bar chart above reports, that total number of customers that were reached out to was 4521 customers, which is the sum of the height of the bars. Out of the contacted customers, 521 subscribed, and 4000 did not. This comes down to a 11.52% campaign success rate.

EDA of age

EDA of age

age is a discrete numerical variable, representing the age of the customer in the dataset. Below ageis plotted as a bar chart, so that we can observe the age distribution of those the campaign targeted.

ggplot( data = bank ) +
  geom_bar( aes( x = factor( age ) ) ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

We can report that the target group of the campaign was that of age between 23 and 60. However, we notice that the bank targeted the age range of 30 to 40 the most.

Next, we plot age as a regular bar chart, with deposit overlay, and as a standardised bar chart with the same overlay.

ggplot( data = bank ) + 
  geom_bar( aes( x = age, fill = deposit ) ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

ggplot( data = bank ) + 
  geom_bar( aes( x = age, fill = deposit ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

From the above plot we read that the target group specifically those between age of 23 and 60 produced very similar outcomes. Even the mainly targeted age group of 30 to 40 years of age did not stand out with its high subscription rate. That is, on average the subscription rate is approximately 10%, which is only 1% point apart from the overall subscription rate of the campaign.

Outside the target group, the campaign was more successful. Meaning, that under the age of 23 and beyond 60, the campaign shows at least 37.5% success rate on average. Meaning, if a customer falls in the age range of less than 23 or more than 60, the more likely he/she will subscribe.

Next we plot the data as a boxplot, grouped by deposit.

ggplot( data = bank ) +
  geom_boxplot( aes( x = deposit, y = age ), fill = c( "red", "blue" ) ) 

Comparison of location: The central location of the two boxplots are approximately shared at 40 years of age. It shows, that the median age for depositors, as well as non-depositors are approximately the same.

Comparison of dispersion: The interquartile range of non-depositors are smaller than that of depositors, and this in general holds for the overall age range as well. That is, the age range for no depositors is narrower than for depositors.

Comparison of skewness: Per above, it was stated that for both groups, the data is positively skewed.

Comparison of potential outliers: Both categeories report some outliers, however, non-depositors have about twice as many outliers than that of depositors.

Based on the above two graphs, we see significant graphical indication of the importance of age’s predictive importance. Therefore, we expect the data mining model to include this variable into the model.

EDA of job

EDA of job

job is a categorical variable and describes the type of job the customers have at the time of the campaign.

We first plot job categories as a bar chart with deposit overlay, so that we can identify the specific target groups per occupation, and see their corresponding deposit rate.

ggplot( data = bank ) +
  geom_bar( aes( x = factor( job ), fill = deposit ), position = "stack" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

From the above bar chart we can distinguish 3 occupations that stand out in terms of frequency from the dataset. These are jobs of total count greater than 750. Namely, blue-collar workers, management personnel, and technicians. The second largest group has more than 375 counts in the dataset and are of administrative workers, and those working in the service sector. The third group is of every other occupation that represent themselves with less than 250 people.

We standardise the barchart above and replotted it below.

ggplot( data = bank ) +
  geom_bar( aes( x = factor( job ), fill = deposit ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

From the standardised plot we read that the campaign was most effective among the following job categories: retired, students, and those with unknown occupation. This finding coincides with the previous variable, age, whereby we saw, that the subscription rate was high age below 23 and beyond 60 years of age, which tends to be those who are students and elderly people.

We further note, that the 3 most targeted job groups do not necessarily produce a favourable outcome for the campaign. That is, their subscription rate is low for blue-collar workers at around 5%, 12.5% for management personel and around 11% for those working in services. The latter two job categories do not deviate a significantly from the overall success rate of the campaign.

For the rest of the occupations, we see deposit rate ranging between 6% and 12%. Since we deemed age an important predictor, we claim that there is strong graphical indication for job being a relevant explanatory variable for target variable deposit. Therefore, we expect job to be included in the data mining model too.

EDA of marital

EDA of marital

The variable marital stands for marital status of the customer. The attribute can take either of the following stautses “married”, “divorced”, “single”, “divorced” which means divorced or widowed.

We first plot marital status with a deposit overlay, and then standardised in a bar chart to observe it there are any noticeable patterns.

ggplot( data = bank ) +
  geom_bar( aes( x = marital, fill = deposit ), position = "stack" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

ggplot( data = bank ) +
  geom_bar( aes( x = marital, fill = deposit ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

From the above plots we see that those with married status are over-represented in the dataset with a count of approximately 2800. The dataset contains approximately 1200 singles, while the rest statuses are approximately 500 that are either divorced or widowed.

The standardised plot allows us to read the deposit proportions among marital statuses; we read that the success rate of the campaign across different marital statuses are similar with ranges between 10% and 12%. We further interpret, that married customers are less keen on subscribing than those with either of the other marital statuses.

Since the variable martial does not show strong graphical indication of predictive importance, it is likely that the data mining algorithm will not include it in the model.

EDA of education

EDA of education

The education variable has four categories; “primary”, “secondary”, “tertiary” and an “unknown”.

  • Primary education or elementary education is typically the first stage of formal education.
  • Secondary education typically takes place after six years of primary education and is followed by higher education, vocational education or employment.
  • Tertiary education refers to all formal post-secondary education, including public and private universities, colleges, technical training institutes, and vocational schools.

We explore education by means of regular bar chart and standardised barchart with deposit overlay.

ggplot( data = bank ) +
  geom_bar( aes( x = education, fill = deposit ), position = "stack" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

ggplot( data = bank ) +
  geom_bar( aes( x = education, fill = deposit ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) )

The above bar charts show that customers with secondary educations are represented the most in the data set followed by tertiary education, and primary education participants.

Upon analysing the standardised chart, we report that those with tertiary education tend to subscribe with higher likelihood than customers with different educational level. Additionally, we notice that the success rate of the campaign is almost identical for those with primary, secondary and unknown education.

Although, the graphical evidence is not strong it is difficult to conclude whether or not the variable education is significant enough to include in the model.

EDA of default

EDA of default

The default variable is a binary attribute that stands for whether the customer has defaulted on his/her credit or not.

We explore the default through bar charts: first as a regular chart with a deposit overlay, and then as a standardised bar chart.

ggplot( data = bank ) +
  geom_bar( aes( x = default, fill = deposit ), position = "stack" ) +
  scale_fill_manual( values = c( "red", "blue" ) ) 

ggplot( data = bank ) +
  geom_bar( aes( x = default, fill = deposit ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) )

We read from the regular bar chart that those who did not default on their credit are over-represented in the dataset by 4445 customers, and only 76 have defaulted. It is understandable from the standpoint of the bank that they target those with good credit history since those customers tend to be more reliable and have the necessary financial means to subscribe for the deposit.

The standardised chart reveals that there is no difference in the deposit rate between the two groups, therefore, the EDA of default is inconclusive. I.e., default does not indicate strong graphical evidence of any predictive importance. Therefore it is likely that the data mining algorithm will not include that in the prediction model.

EDA of balance

EDA of balance

The balance attribute represents the average balance of the customer’s account annually.

ggplot( bank ) +
    geom_histogram( mapping = aes( x = balance ))

One’s balance is closely related to their income. The general consensus over income distribution is that it is positively skewed, we expect balance to be similarly shaped.

From the above histogram we can confirm that balance is indeed positively skewed, with a mean balance of €1423. The wide range of balances gives away that the dataset contains unusually high balances, i.e., chances are that there are outliers in the dataset. Furthermore, we report that balance may also be negative, representing an overdraft, which occurs when money is withdrawn in excess of what is in a current account.

Balances around 0 dominate the data, but to see that rest of the distribution in more details we zoom into it by adjusting the limit of the y-axis to [0, 1000].

ggplot( bank ) + 
  geom_histogram( mapping = aes( x = balance )) +
  coord_cartesian( ylim = c( 0, 1000 ) )

Unfortunately the zoomed in part does not provide necessarily more information but confirms the existence of outliers between the range of €40000 and €70000. To see the potential outliers, we plot balance as a boxplot.

ggplot( data = bank ) +
  geom_boxplot( aes( x = balance))

We identify the outliers represented by the dots in the above plot. To draw any meaningful interpretation, we need to handle outliers by means of imputation, and then replot the graph.

q1 = boxplot(bank $ balance)$stats[2, ]
q3 = boxplot(bank $ balance)$stats[4, ]

iqr = q3 - q1
whisker_lower = q1 - 1.5 * iqr
whisker_upper = q3 + 1.5 * iqr

bank = mutate( bank, balance = ifelse( balance < whisker_lower | balance > whisker_upper, NA, balance ) ) 

bank $ balance = impute( bank $ balance, 'random' )

ggplot( data = bank) + 
  geom_histogram( aes( x = balance, fill = deposit ), position = "stack") + 
  scale_fill_manual( values = c( "red", "blue" ) ) 

The plotting of the imputed data shows symmetry. Also, we read from the chart above, that the count for the deposit follows the shape of the overall histogram, from which we infer that the deposit rate across the several balances is approximately constant. To confirm that, we plot it as a standardised histogram.

ggplot( data = bank) + 
  geom_histogram( aes( x = balance, fill = deposit ), position = "fill") + 
  scale_fill_manual( values = c( "red", "blue" ) ) 

The above expectation set is confirmed. The subscription rate of the campaign is between 9% and 12% and is approximately constant over the difference account balances. Therefore, there is no strong graphical evidence that balance is an important factor in determining deposit. Therefore, we do not expect the data mining algorithm to incorporate the balance variable in the model.

EDA of housing

EDA of housing

Given that the housing is a binary variable, that stands for whether the person has a mortgage, we can first make a contingency table with the target variable, as well as plot housing as a regular bar chart then standardised it with a deposit overlay.

addmargins( table( bank $ deposit, bank $ housing, dnn = c( "Deposit", "Housing" ) ) )
         Housing
  Deposit   no  yes  Sum
      no  1661 2339 4000
      yes  301  220  521
      Sum 1962 2559 4521
ggplot( data = bank ) +
  geom_bar( aes( x = housing, fill = deposit ), position = "stack" ) +
  scale_fill_manual( values = c( "red", "blue" ) )

Both the contingency table, and the bar chart gives away that those with housing loan represent the majority of customers. To be able to judge the proprtions between loan takers and deposit subscribers, we move on to the standardised bar chart.

ggplot( data = bank ) +
  geom_bar( aes( x = housing, fill = deposit ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) )

We see that approximately 15% of those without a mortgage subscribe for the deposit deal, while only 8% of loan holders subscribe. The difference between subscribers with and without housing loan is approximately two fold, which constitute to a significant difference. Therefore, it is reasonable to assume that the data mining algorithm will incorporate housing into the model.

EDA of loan

EDA of loan

Similar to housing, loan is a binary variable as well, therefore we can follow the exact same analysis that we did for housing.

That is, first we report a contingency table, and then plot the data on a bar chart.

addmargins( table( bank $ deposit, bank $ loan, dnn = c( "Deposit", "Loan" ) ) )
         Loan
  Deposit   no  yes  Sum
      no  3352  648 4000
      yes  478   43  521
      Sum 3830  691 4521
ggplot( data = bank ) +
  geom_bar( aes( x = loan, fill = deposit ), position = "stack" ) +
  scale_fill_manual( values = c( "red", "blue" ) )

From the above chart we see that approximately 3800 customers do not have a personal loan, which represents a significant majority of the customers, while the rest 700 do have a loan.

The proportions from the regular bar chart is not clear, therefore, we look at the standardised equivalent of the above chart.

ggplot( data = bank ) +
  geom_bar( aes( x = loan, fill = deposit ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) )

We read, that the success rate of the campaign among customers without a personal loan is about 12.5%, while it is only 6% for those with a personal loan. In this case the difference is more than two fold between the success rates for the two groups, hence we shall expect the data mining algorithm to include loan in the model.

EDA of contact

EDA of contact

The contact variable describes the device on which the customer was contacted. There are 3 categories, that we interpret as follows: the cellular phone refers to the customers’ wireless mobile phone, while telephone to landline phone. The third category is unknown, which may be via indirect ways, such as brochures, billboards etc.

It is a categorical variable, hence we can plot it on bar chart.

ggplot( data = bank ) +
  geom_bar( aes( x = contact, fill = deposit ), position = "stack" ) +
  scale_fill_manual( values = c( "red", "blue" ) )

We see that the majority of customers were reached out to by their cellular phone, or an unknown way.

ggplot( data = bank ) +
  geom_bar( aes( x = contact, fill = deposit ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) )

The standardised chart shows that there is technically no difference between the success rate of cellular and mobile phones, however, the campaign was is not deemed successful on the unknown group.

Cellular and telephone performed beyond the overall campaign’s success rate, therefore, it may be advised to the campaign team to allocate the resources from the unknown contact way to cellular of telephone to elevate the subscription rate.

The above plot indicates strong graphical evidence of the predictive importance of the device, therefore, the data mining algorithm may as well select contact into the model.

EDA of day

EDA of day

The day variable represent the given day in a month. When plotted as a bar chart, we see the last contact day’s frequency.

  ggplot( data = bank) +
    geom_bar(aes( x = day, fill = deposit ), position = "stack" ) +
    scale_fill_manual( values = c( "red", "blue" ) )

We see that the contact points are more frequent in the first 2/3 of the month compared with the last 1/3 of it. We standardise the plot to be able to read the proportions of subscription rate precisely.

  ggplot( data = bank) +
    geom_bar(aes( x = day, fill = deposit ), position = "fill" ) +
    scale_fill_manual( values = c( "red", "blue" ) )

From the standardised plot, we see little to no pattern that would allude graphical evidence that day may have predictive importance in deposit. There are few days, that stand out in terms of success, but it may very well be just random noise.

EDA of month

EDA of month

The month variable indicates the last contact point in terms of month. First we plot it as a regular bar chart to identify the number of contact points made per month.

  ggplot( data = bank) +
    geom_bar(aes( x = month, fill = deposit ), position = "stack" ) +
    scale_fill_manual( values = c( "red", "blue" ) )

We see, that there are 5 months that are significantly different in terms of the number of contacts reached out to, and these are May, June, July, August, and November. These are the months, where the campaign was the most aggressive and reached out to many contacts. Next we standardise the chart to see the success rate of the campaign per month.

  ggplot( data = bank) +
    geom_bar(aes( x = month, fill = deposit ), position = "fill" ) +
    scale_fill_manual( values = c( "red", "blue" ) )

We see from the above plot that March, September, October and December were the most successful months, and were able to campaign with more than 40% success rate on average. With a little bit lower subscription rate are February and April; their respective success rate are 15% to 20%. We see that the above plot indicates strong graphical evidence of the predictive importance of month.

EDA of duration

EDA of duration

The duration variable is the last contact’s duration expressed in seconds. We plot the data as a histogram with bin size of 60 seconds, so that each bar in the chart would represent a minute of duration.

ggplot( bank ) +
    geom_histogram( mapping = aes( x = duration ), binwidth = 60 )

The wide range of the x-axis tells that the data has some outliers. Furthermore, the distribution of duration is positively skewed, which is expected, since most of the customer tend to keep promotional/marketing calls as short as possible.

Next we query some descriptive statistics on the variable in question.

summary(bank $ duration)
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
        4     104     185     264     329    3025

We can report that the mean duration of the phone calls were 264 seconds. We can see that 25% of customers kept the phone call under 104 seconds, 50% of them under 185 seconds while 75% just under 329 seconds.

To identify outliers of the data, we will plot it as a boxplot groupped by deposit.

ggplot( data = bank ) +
  geom_boxplot( aes(y = duration ) ) 

The above boxplot identifies outliers beyond 666.5 seconds. Any observation beyond that duration is deemed an outlier. Next, to deal with the outliers, we replace the outliers with NA, impute data, and finally re-plot the graph groupped by deposit.

q1 = boxplot(bank $ duration)$stats[2, ]
q3 = boxplot(bank $ duration)$stats[4, ]

iqr = q3 - q1

whisker_lower = q1 - 1.5 * iqr
whisker_upper = q3 + 1.5 * iqr

bank = mutate( bank, duration = ifelse( duration < whisker_lower | duration > whisker_upper, NA, duration ) ) 

bank $ duration = impute( bank $ duration, 'random' )

ggplot( data = bank) + 
    geom_histogram( mapping = aes( x = duration, fill = deposit) , binwidth = 60 ) + 
    scale_fill_manual( values = c( "red", "blue" ) )

It is worth to note, that the imputed data held its positively skewed shape, however, the the number of successful contacts per minute follows a symmetric distribution, with a peak at approximately 220 seconds.

Next we standardise the above histogram.

ggplot( data = bank) + 
    geom_histogram( mapping = aes( x = duration, fill = deposit) , position = "fill", binwidth = 60 ) + 
    scale_fill_manual( values = c( "red", "blue" ) )

From the standardised histogram we clearly identify that the longer the duration of the call, the higher the proportion of subscribers for deposit.

Therefore, there is a strong graphical evidence of the predictive importance of duration, so we expect the data mining algorithm to incorporate this variable into the model.

EDA of campaign

EDA of campaign

The campaign variable stands for the number of contacts performed during the campaign for specific contacts.

First, we plot the data as a bar chart.

ggplot( data = bank) +
  geom_bar(aes( x = campaign, fill = deposit ), position = "stack" ) +
  scale_fill_manual( values = c( "red", "blue" ) )

We identify, that number of times customers were contacted is heavily positively skewed. That is, most of the targeted people were only contacted a few times. Furthermore, based on the width of the x-axis, we see a clear indication of outliers. To deal with the outliers, we first plot the data on a boxplot, confirm the existense of outliers, and only then mutate and impute the data and finally re-plot it.

ggplot( data = bank ) +
  geom_boxplot( aes( x = campaign) ) 

The above boxplot confirms the existene of outliers, represented by the dots beyond the right whisker.

q1 = boxplot(bank $ campaign)$stats[2, ]
q3 = boxplot(bank $ campaign)$stats[4, ]

iqr = q3 - q1

whisker_lower = q1 - 1.5 * iqr
whisker_upper = q3 + 1.5 * iqr

bank = mutate( bank, campaign = ifelse( campaign < whisker_lower | campaign > whisker_upper, NA, campaign ) ) 

bank $ campaign = impute( bank $ campaign, 'random' )

ggplot( data = bank) +
  geom_bar(aes( x = campaign, fill = deposit ), position = "stack" ) +
  scale_fill_manual( values = c( "red", "blue" ) )

We report that observations beyond count 6 contact points are eliminated, yet the original distribution skewness kept its shape. That it, after the imputation, the data shows that the majority of customers were reached out only a few times.

Next we standardise the above distribution to report the subscription rate per the number of contact points in the campaign.

ggplot( data = bank) +
  geom_bar(aes( x = campaign, fill = deposit ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) )

The above plot shows that the subscription rate is approximately constant during the first 4 contact points, at around 10% on average. To be more precise, we see that the first contact point yield a bit over 12.5% success rate, while it drops to around 10% for the second and third call, and eventually picks up to again 12.5% for the fourth call.

The major drop is after the fifth contact point, where the success rate is consistently below 10%

Although there are some learning of the above graphs, the plot does not indicate strong graphical evidence of predictive importance of campaign.

EDA of pday

EDA of pday

pday stands for the number of days passed since the customer was last contacted. The field value may take on -1, which means that the customer was not contacted previously, and the current campaign contact is the first.

First, we report the frequency of each day passed as a bar chart.

  ggplot( data = bank) +
    geom_bar(aes( x = pdays, fill = deposit ), position = "stack" ) +
    scale_fill_manual( values = c( "red", "blue" ) )

We not only see that the values along the x-axis spreads out on a wide range but also, that there are a significant overweight of -1 observations, referring to that the the majority of customers were not contacted previously.

It is an advised to filter out those contacts from the data who were not reached out earlier, otherwise we areunable to draw a meaningful conclusion whether the pday is deterministic for the outcome for deposit

We achieve this by plotting pdays only for those who were reached out in the previous campaign.

  ggplot( data = bank) +
    geom_bar( data = subset( bank, pdays != -1 ), aes( x = pdays, fill = deposit ), position = "stack" ) +
    scale_fill_manual( values = c( "red", "blue" ) )

Next, we plot a boxplot for the same observations, to see the outliers.

ggplot( data = bank ) +
  geom_boxplot( data = subset( bank, pdays != -1 ), aes(y = pdays )) 

The boxplot identifies some outliers, that we first replace by NA values and then impute them, and finally re-plot it as a bar chart.

bank2 = mutate( bank, pdays = ifelse( pdays == -1, NA, pdays ) ) 

q1 = boxplot(bank2 $ pdays)$stats[2, ]
q3 = boxplot(bank2 $ pdays)$stats[4, ]

iqr = q3 - q1

whisker_lower = q1 - 1.5 * iqr
whisker_upper = q3 + 1.5 * iqr

bank2 = mutate( bank2, pdays = ifelse( pdays < whisker_lower | pdays > whisker_upper, NA, pdays ) ) 


bank2 $ pdays = impute( bank2 $ pdays, 'random' )

ggplot( data = bank2) +
  geom_bar( aes( x = pdays, fill = deposit ), position = "stack" ) +
  scale_fill_manual( values = c( "red", "blue" ) )

From the imputed data we see, that the majority of customers who were contacted previously, were contacted again for a follow up call between 35 and 375 days.

To see the success rate of each of the pdays, we plot it as a bar chart grouped by deposit.

ggplot( data = bank2) +
  geom_bar(aes( x = pdays, fill = deposit ), position = "fill" ) +
  scale_fill_manual( values = c( "red", "blue" ) )

Based on the above standardised graph, we see no strong graphical evidence of the predictive importance of pdays. Therefore, we do not expect the data mining algorithm to include pdays in the model.

EDA of previous

EDA of previous

The variable previous stands for the number of contacts performed before the current campaign. First we plot it as a bar chart.

  ggplot( data = bank) +
    geom_bar(aes( x = previous, fill = deposit ), position = "stack" ) +
    scale_fill_manual( values = c( "red", "blue" ) )

We see that the majority of the customers were not contacted previously. This observation coincides with the findings in pdays, where these observations were denoted by the value of -1.

Next, we standardise the data.

  ggplot( data = bank) +
    geom_bar(aes( x = previous, fill = deposit ), position = "fill" ) +
    scale_fill_manual( values = c( "red", "blue" ) )

From the above plot we read, that the more the customer were contacted in the previous campaign but less than 10 times, then on average the likelihood of subscribing is higher for the current campaign. Although there are some valleys in the deposit rate for previous within the range of [0, 10], but the overall trend is upward sloping.

In this case, we find strong enough graphical evidence that previous is may serve as a good predictor for deposit, and hence expect the data mining algorithm to incorporate previous in the model.

EDA of poutcome

EDA of poutcome

poutcome represents the outcome of the previous campaign, and can take on either of the following 4 values: “failure”, “success”, “unknown”, “other”.

We first plot it as a bar chart.

  ggplot( data = bank) +
    geom_bar(aes( x = poutcome, fill = deposit ), position = "stack" ) +
    scale_fill_manual( values = c( "red", "blue" ) )

We see that the unknown category has the highest count, which again is likely to represent the customers that were not contacted previously, and thereby coincides with data of pdays and previous.

Next, we plot poutcome as a standardised bar chart.

  ggplot( data = bank) +
    geom_bar(aes( x = poutcome, fill = deposit ), position = "fill" ) +
    scale_fill_manual( values = c( "red", "blue" ) )

We interpret the above graph as that those previously subscribed for the deposit are more likely to subscribe again in the current campaign. The above plot indicates strong graphical evidence of the predictive importance of poutcome, and so the data mining algorithm will include poutcome in the model.

Conclusion

Conclusion

In the above exploratory data analysis we analysed 17 variables, out of which we found the following ones graphically indicative enough to consider as a relevant predictor for deposit.

  • age: Customers under age of 23 or above 60 are more likely to subscribe.

  • job: Customers with unknown, student or retired occupation are the most likely to deposit. Next to that, customers with administrative, housemaid, management, or technician occupation are the second most likely to deposit.

  • education: Customers with a tertiary educational background are the most likely to subscribe for the deposit deal.

  • housing: Customers with a mortgage tend to subscribe twice as likely as those without a mortgage.

  • loan: Customers with a personal loan tend to subscribe twice as likely as those without a personal loan.

  • contact: Customers reached out via cellular or telephone tend to be a subscriber more than those reached by unknown means.

  • month: Customers that were contacted last in March, September, October or December are the most likely to respond positively to the campaign and subscribe.

  • duration: The higher the duration of the phone call, the more likely the customer subscribes.

  • previous: The more times the customer was contacted in the previous campaign (but less than 10 times), the more likely he/she will subscribe in the current campaign.

  • poutcome: If the customer subscribed for the deal in the previous campaign, it is likely the he/she will do it again in the current campaign.

3 Bonus: Exploratory Data Analysis for your own dataset (30 points)

In this part, you could apply Exploratory Data Analysis to explore your own dataset. You could follow the same steps as in part 1 (above) of these exercises.